Document Summarization finds the most important information in a document and creates an abridged version.
“The ideal of automatic summarization work is to develop techniques by which a machine can generate summarize that successfully imitate summaries generated by human beings.” (1)
Summaries should read fluently and stand on their own while preseving key information. (2)
Tasks for extractive
Model makes the graph from the document, then summarizes it by considering the relation between the nodes (text-unit). The graph is constructed by creating a vertex for each sentence in the document.
The damping factor is included to prevent pages with no outgoing links from absorbing the Page Ranks of connected pages. The example we went over had a damping factor of 0.
Based on PageRank algorithm used by Google Search Engine. The basic premise is a linked page has higher value if it is linked from many linked pages.
TextRank considers words or sentences as “pages” on the PageRank and a similarity measure supplements the transition matrix. The measure is based on the number of words two sentences have in common and is normalized by the sentences’ lengths.
Uses cosine similarity of TF-IDF vectors
First generates graph of all the sentences in the corpus. Each sentence represents a node and the edges are the similarity relationship between all sentences in the corpus. The following formulation is used to measure the similarity.
This measures the distance between two sentences (x and y). The more similar the sentence, the closer x and y are located.
The similarity measure is then used to build a similarity matrix, which can be used as a similarity graph between sentences. The LexRank algorithm measure the importance of sentences in the graph by considering its relative importance to its neighboring sentences, where a positive contribution will raise the importance of a sentence’s neighbor, while a negative contribution will lower the importance value of a sentence’s neighbor.
To extract the most important sentences, from the resulting similarity matrix we apply a thresholding mechanism. A threshold value is used to filter out the relationships between sentences whose weights are fall below the threshold. The result is a subset of the similarity graph, from where we can pick one node that has the highest number of degree. This node is considered salient or represents a summary sentence of the corpus.(6)
Model Extracts the features of a sentence, then evaluates its importance.
Some examples of features are:
Luhn proposed that the most frequent words represent the most important concept of a text. His process gives each sentence a score based on the number of occurrences of words and choosing the sentence with the highest score.
Calculates the topic of the document and evaluates the sentences by topics included.
LSA takes an input matrix and performs singular value decomposition (SVD) to identify patterns in the relationship and similarity between the terms and sentences. LSA summarization algorithms usually contain 3 main steps.
Step 1: Create the input matrix, which has each row represents the word and each column is a sentence. The cell value represents the importance of the word which can be filled using either frequency of word, TF-IDF, log entropy, etc.
Step 2: SVD breaks up the input matrix into three new matrices as follows: \[ A = USV^T \] where A is the original m by n input matrix; U is words x extracted concepts; S represents scaling values, diagonal descending matrix; and V is sentences x extracted concepts.
Step 3: Sentence selection by using the results of SVD.
Ex: Three sentences are given as an input to LSA
S1: “The man walked the dog”
S2: “The man took the dog to the park”
S3: “The dog went to the park”
Using the Gong and Lui algorithm for sentence selection we inspect the \(V^T\) matrix
\[\begin{array} {rrrr} & S1 & S2 & S3 \\ Concept 1 & 0.457 & *0.728* & 0.510 \\ Concept 2 & -0.770 & 0.037 & 0.637 \end{array} \] In \(V^T\) matrix, row order indicates the importance of the concepts, therefore the first row indicates the most important concept. In this example, sentence 2 is determined to contain the most important concept.
LSA has several limitations. The first drawback of LSA is it has difficulty handling words with multiple meanings depending on the context. The second is that it does not use information about word order, syntactic relations, and morphologies. Lastly, with larger and more inhomogeneous data the performance of LSA decreases sharply due to SVD.
Parses the text and constructs grammatical structure, and then reorders the substructures. Analyzing grammatical structure is useful in preserving a meaning in a phrase.
“In the disputed area of East Timor”
It may help to imagine Extractive techniques as highlighting words, phrases, or sentences in order to summarize the document while Abstractive techniques are more like using a pen to write a summary in your own words. Having computers write an abstractive summary proves to be much more complicated than extractive techniques but methods like Recurrent Neural Networks have the potential to be extremely power!
Recurrent Neural Networks (RNN) are a type of neural netowrk that performs calculation on sequential data (like words in a sentence). (7)
Lets examine a quick example of an RNN for a simple sentence:
“Germany emerge victorious in 2-0 win against Argentina on Saturday”
Now we can look at the diagram below to see how an RNN might be able to look at the sentence to summarize it.
Once the encoder has read in the text, the decoder RNN begins to find words and place them in sequence based what the RNN was trained on. In this case, the RNN is able to pick up the starting subject (Germany) and uses the words victorious and win to come up with the word beat as the next part of the summary. The message will proceed through the message until a proper summary has been produced.
It may seem like its doing a pretty good job summarizing the document, however, two large problems present themselves: Problem 1 - The summaries sometimes reproduces factual detail inaccuratly (such as the score of the soccer match) because these are rare words or words that do not appear in the voculary network.
Problem 2 - The RNN can sometimes get caught in a loop repeating the same words or phase over and over again (e.g. Germany beat Germany beat Germany beat…) (7)
If you want additional information as to how RNNs get around these problems, please see the online article where this information and diagram was collected here. This article does an execellent job explaining the basics of document summarization using Recurrent Neural Networks
The topic of Recurrent Neural Networks is extremely vast and a potentially powerful tool to produce detailed abstractive summaries if done in the proper way.
Document summarization is very important and relevat in today due to the vast amount of documents out there. While people may not have the time to read through and summarize each document, computers via text mining are able to do it much quicker. Each technique is better at summarizing different types of documents. Next class we will get into some examples in R on how to use this software to summarize documents. While the software does most of the heavy lifting, data preprocessing is vital in order for the packages to run properly.